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Abstract 

This paper describes a CUDA implementation of Wagener's PRAM convex hull algorithm in 
^ BI2I . It is presented in Knuth's literate programming style. 



The source of this document is a .nw file (for 'noweb,' an implementation of Knuth's literate pro- 
O ■ gramming technique: see 'Literate programming with noweb,' by Andrew L. Johnson and Brad C. 

^ \ Johnson, Linux Journal, October 1st 1997). Noweb allows one to mix LaTeX with C (or pretty well 

any programming language), allowing a well-annotated program. One can extract 'chunks' from it. 

You need the noweb system, of course (that is, notangle to extract the C part and noweave to typeset 
rS ■ the full document). 

c3 ■ This document includes a Makefile. To start the ball rolling, you can extract it as follows: 



notangle -t8 -Rwagener . Makefile wagener.nw > wagener .Makefile 



With it, you can make a CUDA source file (wagener.cu) or a DVI copy of this document (make 
dvi produces wagener.dvi) 

There is one problem with wagener.cu. The construct <<<. . . »> is a necessary part of the cuda 
source code, and it conflicts with noweb 's construct <<...>>. Therefore wagener.cu contains 

match_and_merge LLL range, block RRR ( hood, newhood, scratch ); 



*e-mail: odunlain@maths.tcd.ie. Mathematics department website: |http://www.maths.tcd.ie[ 
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and it must be edited, changing LLL to <« and RRR to »>. 

{copyright) = 
/* 

* Copyright (C) 2010-12 Colm Dunlaing (odunlainSmaths .ted. ie) 
* 

* This file is free software: you can redistribute it and/or modify 

* it under the terms of the GNU General Public License as published 

* by the Free Software Foundation, either version 3 of the License, or 

* (at your option) any later version. 
* 

* This program is distributed in the hope that it will be useful, 

* but WITHOUT ANY WARRANTY; without even the implied warranty of 

* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 

* GNU General Public License for more details. 
* 

* You should have received a copy of the GNU General Pu blic License 



* along with this program. If not, see <| widthOpt |http : //www . gnu . org/licenses/ > 
*/ 



2 Wagener 's algorithm, CUDA version 

The present goal is to make a working CUDA version of Wagener' s PRAM algorithm for computing 
an upper hood for a set of n points presented in left-to-right order. 

We have not considered the memory access patterns which may seriously degrade performance. 
Again, thread divergence may degrade performance — it is an interesting exercise to write 'non- 
divergent' code. This has been done in some places and not in others. 

This program assumes that 

• n (the number of points) is a power of 2. 

• No three points are coUinear. 

• All x-coordinates are between and 1. We shall use the point REMOTE, (10, 0), for padding. 
Any point whose x-coordinate is > 1 is assumed to be 'remote,' used for padding. 

• There are no floating-point errors (i.e., it's a problem, but it's not our problem.) 

Also, three shared device arrays, one short and two f loat2 are used, of size n. Their total 
size is 18n bytes. This puts inessential limitations on n — there would be no difficulty, and little 
overhead, in slicing the data into manipulable chunks for larger n. 

(wagener) = 
(copyright) 
(globals) 

(match and merge) 
(main) 
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Figure 1: Points and hoods. The z-coordinates have been distorted in the depiction of host_hood. 

{globals)= 
#include <stdio.h> 
#include <stdlib.h> 
#include <cuda.h> 

float2 * point; 

int count ; 

float2 * host_hood; 

/* The following are device variables */ 
float2 * hood, * newhood; short * scratch; 

float2 REMOTE = i lO.Of , O.Of }; 

* make_remote () without memcopy 

device void niake_remote ( float2 * p ) 

{ 

p->x = 10. Of; p->y = O.Of; 
} 



Points are stored in the array point, and initially copied to host_hood. The main program 
launches the global routine match_and_merge repeatedly to merge adjacent hoods from inter- 
vals of size d to hoods of size 2d. 

The algorithm repeatedly copies host_hood [ ] to device array hood [ ] , launches 
match_and_merge ( ) , and copies the device array newhood [ ] to host_hood [ ] . 

Let s = log2 n; s is a positive integer. The hood is built in s — 1 stages (there is nothing to do if 
s = 1). At the r-th stage, let d = T: host_hood defines njd hoods. For < £ < njd, let P be 
the £-th block of d points from point (indexed from tdXa Id^ d — V). The £-th hood is H(P). The 
comers of H{P) are stored in the corresponding block of host_hood, shifted left and padded with 
copies of REMOTE (Figured). 

Next, n/2 match_and_merge threads are launched in n/{2d) blocks of dimension di x d2, 
where di = 2^'/^^ and ^2 = 2L''/^J, so d = did2- The £-th block of threads cooperate to compute 
H{P U Q), where P and Q are the 2£-th and 2£ + 1-st interval of d points, locating the common 
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tangent of H{P) and H{Q) and replacing these separate hoods hy H{P U Q), shifted and padded in 
a block of 2d entries in hood. 

The routine make_remote ( f loat2 *p) is used to set a point to remote values (I'm not sure 
how to assign a constant float 2 value in device code). 

{main) = 

int pos_power_of _2 ( int x ) 

if ( X < 2 ) 
return 0; 

while ( X > 1 ) 
if ( X % 2 == 1 ) 

return 0; 
else if ( X == 2 ) 

return 1 ; 
else 
X /= 2; 
} 

void show_current_hoods ( FILE * outfile, int d ) 
i 

int i, j, hoodsize; 

fprintf (outfile, ""/odXn", count/d) ; 

for ( i=0; i<count/d; ++i ) 

{ 

hoodsize = 0; 

for ( j=0; j < d; ++j ) 

if ( host_hood[i*d+j] .X <= 1.0 ) 
++ hoodsize; 
fprintf (outfile, "°/od\n" , hoodsize) ; 
for ( j=0; j<d; ++j ) 

if ( host_hood[i*d+j] .X <= 1.0 ) 

fprintf (outfile, "y„f Zf \n" , host_hood[i*d+j] .x, 
host_hood[i*d+j] .y) ; 
} 

fprintf (outfile, "\n") ; 
} 

main( int argc, char * argv[] ) 
{ 

int i; 

int d, dl, d2; 

FILE * file; 

FILE * trace; 

short * h_scratch; 

count = 0; 

if ( argc != 2 && argc != 3 ) 
{ 

fprintf (stderr , 

"usage: °/oS <sorted counted points file> [<trace file>]\n". 
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argv [0] ) ; 
exit(-l) ; 
} 

The program copies the points to standard output, computes the hood and writes the hood 
points to standard output. It may write comment lines beginning #. If the trace file is used, it prints 
the intermediate hood sequences to this file. The program output is intended to be sent to a companion 
program hoodlps which generates postscript. 

(mam)+= 

file = fopen ( argv[l] , "r" ); 

if ( file == NULL ) 

{ 

fprintf (stderr, "Zs unreadable\n" , argv[l]); 

exit(-l) ; 
} 

trace = NULL; 
if ( argc == 3 ) 
{ 

trace = fopen ( argv [2], "w" ); 

if ( trace == NULL ) 

fprintf(stderr, "Can't write to Zs, no tracing\n" , argv[2]); 
} 

fscanf (file,"yod", &count) ; 

if ( ! pos_power_of _2 ( count ) ) 

{ 

fprintf (stderr, "Count °/od not a power of 2, abort\n" , count); 

exit(-l) ; 
} 

printf ("%d\n", count); 

point = (float2*) malloc (count * sizeof (f loat2) ); 
host_hood = (float2*) malloc (count * sizeof (float2) ); 
h_scratcli = (short*) malloc ( count * sizeof ( short ) ) ; 

for (i=0; Kcount; ++i) 
{ 

fscanf(file, "%f %f", & (point [i] .x) , &(point [i] .y)) ; 

printf("yof "/of \n" , point[i].x, point[i].y); 

host_hood[i] = point [i] ; 
} 
printf ("\n"); 

dl = 2; 
d2 = 1; 
d = dl * d2; 

hood = newhood = NULL; 
scratch = NULL; 
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The array point will contain the data points, and host_hood will contain the interme- 
diate hoods as illustrated in Figure [1] H_s cratch is to hold a copy, on the host, of the device array 
scratch, for debugging. The shared device arrays hood, newhood, scratch are allocated 
at every thread launch. Also, host_hood needs to be copied to hood before the thread launch. 

(mam)+= 

while ( d < count ) 
{ 

if ( trace != NULL ) 

show_current_hoods ( trace, d ); 
if ( hood != NULL ) 
{ 

cudaFree ( hood ) ; 
cudaFree ( newhood ) ; 
cudaFree (scratch) ; 
} 

cudaMallocC (void **) & hood, count * sizeof( float2 )); 
cudaMemcpy( hood, host_hood, count * sizeof (f loat2) , 
cudaMemcpyHostToDevice) ; 

cudaMalloc( (void **) & newhood, sizeof ( float2 ) * count ); 
cudaMalloc( (void **) & scratch, count * sizeof (short) ); 

Now THE THREAD LAUNCH: n threads in n/{2d) blocks of dimension di x 62- 

{main)+= 

/* 

* LLL and RRR need to be replaced 

* by triple < and >: double < and > 

* have a special meaning in noweb, 

* the literate programming system 

* we use. 
*/ 

dimS range ( count / (2*d) ) ; 

dim3 block ( dl , d2 ) ; 

match_and_merge LLL range, block RRR ( hood, newhood, scratch ); 

When all threads have terminated, copy the revised array newhood to host_hood, and 
print various debugging items. 

{main)+= 

cudaMemcpy(host_hood, newhood, count * sizeof (float2) , 
cudaMemcpyDeviceToHost) ; 

printf ("#returned from match_and_merge, dl=y„d, d2=°/od, d=yod\n" , 
dl, d2, d); 

cudaMemcpy(h_scratch, scratch, count * sizeof (short), 
CudaMemcpyDeviceToHost) ; 
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{main)+= 

printf ("#scratch:\n#") ; 
for(i=0; Kcount; ++i ) 
{ 

printf ("y„3d ", h_scratch[i] ) ; 
if ( i > && i % 10 == ) 
printf ("\n#"); 
} 
printf ("\n"); 



} 



if ( dl > d2 ) 

62 *= 2; 
else 

dl *= 2; 
d = dl * d2; 



The following is for debugging. 

(mam)+= 

cudaMemcpy(host_hood, newhood, count * sizeof (f loat2) 
cudaMemcpyDeviceToHost) ; 

printf ("#newhood contents\n") ; 
for (i=0; Kcount; ++i) 

printf ("#yof °/of\n", host_hood[i] .x, host_hood[i] .y) ; 

if ( trace != NULL ) 
{ 

f printf (trace, "0\n") ; 

f close ( trace ) ; 
} 

show_current_hoods ( stdout , count ) ; 

return ; 



The remaining functions are on the device. 
The function lef t_of ( ) returns 1 if r is left 
of the directed line- segment pq, 
(i.e., det(g — p,r — p) > 0), otherwise. 




q 
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{match and merge) = 

* left.of 

device int left_of ( float2 r, float2 p, float2 q ) 

float value; 

value = 

(q.x - p.x) * (r.y - p.y) - (q.y - p.y) * ( r.x - p.x ); 

return ( value > ) ; 



Suppose P and Q are adjacent intervals of points processed by a thread block in mat ch_and_merge. 
Given two points p and g g is either a comer of H(Q) or is remote, and p is to the left of Q, there is 
a unique tangent to H(Q) from P: suppose q' is the comer of H{Q) which supports the tangent. Let 
f{p, q) be LOW, EQUAL, or HIGH according as q is left of, at, or right of q' (high if g is remote). 

Similarly Up is remote or on H{P) and q is to the right of P, a function /(p, q) indicates whether 
p is left of, at, or right of the point supporting the tangent to H{P) from q (or remote). 

These functions are implemented (on the device) by g and f below, where p = hood [ i ] and q = 
hood [ j ] and P is defined by the range start . . start+d-l,(5by start+d. . start + 2*d-l. 



f(p ,q ) 




LOW 




EQUAL 




HIGH 



8(p ^q ) 

{match and merge) +^ 

#define LOW -1 
#define EQUAL 
#define HIGH 1 




LOW 




EQUAL 




HIGH 



device short g( float2 * hood, short i, short j 

short start, short d ) 
{ 

float2 p, q, q_next, q_prev; 

int atstart, atend; 

int isleft; 



if ( hood [j] . X > 1 ) /* REMOTE */ 
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return HIGH; 

p = hood[i] ; 
q = hood[j] ; 

atend = ( j == start + 2*d - 1 I I hood[j+l] .x > 1.0 ); 

Atend signals the condition that q is the rightmost comer of H{Q). As written, it might cause 
thread divergence, which could be remedied by adding an extra slot in hood and making it REMOTE . 
Using atend, we can (without divergence) make q_next default to a point directly underneath the 
righmost comer in H(Q), in the case where q is the last comer in H{Q). 

If q_next is left of pq, then q is LOW. 

{match and merge) += 

q_next = hood [ j+1-atend ] ; 
q_next.y -= (float) atend; 

if ( left_of ( q_next, p, q ) ) 
/* 
* avoidable divergence? 
*/ 
return LOW; 

Similarly atstart indicates whether q is leftmost in H{Q), in which case q_prev is directly 
below it; otherwise it is the comer of H(Q) to its left; g is HIGH iff q_prev is left of the directed 
line-segment pq. 

{match and merge) += 

atstart = ( j == start + d ) ; 
q_prev = hood[ j + atstart - 1 ] ; 
q_prev.y -= (float) atstart; 

isleft = left_of ( q_prev, p, q ); 

return HIGH * isleft + EQUAL * (1-isleft) ; 



* f ( i, j , start, d ) 

device short f( float2 * hood, short i, short j, 

short start, short d ) 
{ 

float2 p, q, p_next, p_prev; 

int atstart, atend; 

int isleft; 
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if ( hood [i] . X > 1 ) /* REMOTE */ 
return HIGH; 

p = hood[i] ; 
q = hoodEj] ; 

atend = ( i == start + d - 1 I I hood[i+l] .x > 1 ); 

p_next = hood [ i+1-atend ] ; 
p_next.y -= (float) atend; 

if ( left_of ( p_next, p, q ) ) 
return LOW; 

atstart = ( i == start ) ; 

p_prev = hood[ i + atstart - 1 ] ; 

p_prev.y -= (float) atstart; 

isleft = left_of ( p_prev, p, q ); 

return HIGH * isleft + EQUAL * (1-islef t) ; 
} 

The workhorse of Wagener's algorithm is the 
mat ch_and_merge procedure below. Recall that n/{2d) 
threads are launched in blocks of dimension di x ^2. 
The £-th block is to calculate H{P U Q), where P and 
Q are intervals of d points in hood beginning at 2di 
(this offset is computed and stored in start). First 
start and other parameters are computed, and the 
scratch array is set to a recognisably 'uninitialised' value. 
(scratch [start . . start + 2*d-l] is shared by the 
threads in the same block). 

The main effort is calculating the comers of H{P) and H{Q) supporting the common tangent. 
Their indices will be placed in pindex and qindex, initially —1 to show uninitialised. 

There are di sample points along H(P) and ^2 along H(Q), but some of them will be REMOTE. 

Match_AND_merge begins by setting the variables di,d2,d, start, x,y,, indx to mirror 
the construction of its thread blocks. Also, pindex, qindex, scratch are set to negative 
values, meaning not initialised. Also i and j are set to sample comers (indices) in H(P) and H(Q). 
There are di sample indices i and ^2 sample indices j. If I is the set of sample indices i, namely, 
/ = {start + d2X : < x < di}, and correspondingly J = {start + d + diy : < y < ^2}, 
then the procedure is outlined as follows. 

For < X < di, let i^ = start + d2X, so I = {i^ : < x < di}. Also, for < y < d2, let 
jy = start + d + diy, so J = {jy : < y < ^2}- 

{match and merge) += 
(mam 0: intialisations) 

Imam 1: 0<=x<dl scratch[start+x]=max jy g(ix,jy) <= EQ) 
(mam 2: 0<=x<dl scratch[start+d+x]=jl(x)=uniquejg(ix,j) =EQ) 
[mam 3: scratch[ start] =kO=max ixf(ix,jl(x)) <= EQ) 




Figure 2: thread allocation. 
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(mam 4: 0<=y<d2 scratch[start+d+y]=ly=maxjx, 0<=x<d2, g(kO+y,jx)<=EQ) 
Imam 5: scratch[start..]= unique p=kO+y,q=ly+x, g(p,q)=f(p,q)=EQ) 
(mam 6: newhoodf start.. ]= hood[start..p] catenated hood[q..start+d-l ]) 



{mam 0: intialisations)= 

* match_and_merge () 

global void match_and_merge ( float2 * hood, float2 * newhood, 

short * scratch ) 
{ 

int i, j, pindex, qindex, shift; 

int dl, d2, d, start, x, y, indx; 

dl = blockDim.x; 

d2 = blockDim.y; 

d = dl * d2; 

start = blockldx.x * 2 * d; 

X = threadldx.x; 
y = threadldx.y; 
indx = x + dl * y; 

pindex = qindex = -1; 

scratch [ start + indx ] = -1; 
scratch [ start + indx + d ] = -1; 

syncthreadsO ; 

i = start + d2 * x; 

{mam 1: 0<=x<dl scratch[start+x]=maxjy g(ix,jy) <= EQ) = 
if ( hood[i].x <= 1.0 ) /* not REMOTE */ 
{ 

j = start + d + dl * y; 

/* 

* The condition below should identify the 

* unique interval of H(Q) touching the 

* tangent from hood[i] . 
*/ 

if ( g (hood, i,j, start, d) <= EQUAL && 
( y == d2 - 1 II 
hood[j+dl] .X > 1.0 II 
g(hood,i,j+dl, start, d) == HIGH ) 
) 
scratch [ start+x ] = j; 
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syncthreadsO ; 

{mam 2: 0<=x<dl scratch[start+d+x]=jl(x)=uniquejg(ix,j) =EQ) = 
if ( hood[i] .X <= 1.0 ) 
{ 

j = scratch [start + x] + y; 

if ( g (hood, i, j , start, d) == EQUAL ) 

scratch [start + d + x] = j ; 
else if ( d2 < dl && g (hood, i,j+d2, start ,d) == EQUAL ) 
scratch [start + d + x] = j+d2; 
} 

syncthreadsO ; 

Suppose that p and q are the actual comers to be calculated, supporting the common tangent 
to H{P) and H{Q). For each sample point pi a corresponding tangent comer q'^ on H(Q) has been 
calculated. 




(2.1) Theorem The tangent corners q[ occur in nondecreasing left-to-right order, and pi is left of 
equal to, or right ofp according as f{pi, q[) is LOW, EQUAL, or HIGH. 

Sketch proof. Parametrise the tangents to H{Q) by the angle 9 they make with the x-axis: 6 
varies over the clockwise interval from 90° (yielding the left vertical tangent) to —90°. 

For each 9, let Lg be the half-plane left of the tangent 
line at angle 6 (except at ±90°, this means above the tan- 
gent line). The map 9 ^ Lg is, loosely speaking, con- 
tinuous, and H(P) fl Lq contracts with 9. The point of 
contact between Lg and H{Q) shifts discontinuously from 
comer to comer, but always rightward. At a unique angle, 
9 = a, say, the intersection contains a single point, and 
that point is p. The points pi under consideration are left 
and right endpoints of various sets H{P) fl Lg, the points 
q'i are points of contact between various Lg and H{Q), and 
the points Pi are left of, at, or right of p according to the values of f{pi, q'j). | 

{mam 3: scratch[start]=kO=max ixf(ix,jl(x)) <= EQ) = 
j = scratch [start +d+x] ; 
if ( hood[i] .X <= 1.0 && 

f (hood, i,j , start, d) <= EQUAL && 
( X == dl-1 I I 
hood[i+d2] .x > 1.0 II 

f (hood, i+d2, scratch [start + d + x + 1], start, d) == HIGH 
) 
) 
scratch [start] = i; 



Figure 3: Lg 



syncthreadsO ; 
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{mam 4: 0<=y<d2 scratch[start+d+y]=ly=maxjx, 0<=x<d2, g(kO+y,jx)<=EQ) = 
i = scratch [start] + y; 

if ( hood[i].x <= 1.0 ) /* not REMOTE */ 
{ 

j = start + d + X * d2; 
if ( g (hood, i, j , start, d) <= EQUAL && 
( X == dl - 1 II 
hood[j+d2] .X > 1.0 II 
g (hood, i,j+d2, start, d) == HIGH ) 
) 

scratch [start + d + y] = j; 
} 

syncthreadsO ; 

{mam 5: scratch[ start.. ]= unique p=kO+y,q=ly+x, g(p,q)=f(p,q)=EQ) = 
j = scratch [ start + d + y ] + x; 

if ( X < d2 && 

g (hood, i, j , start, d) == EQUAL 
&& 

f (hood, i,j, start, d) == EQUAL ) 
{ 

scratch [start ] = i; 

scratch [start +1] = j ; 
} 

syncthreadsO ; 

{mam 6: newhood[ start.. ]= hood[start..p] catenated hood[q..start+d-l ])= 
pindex = scratch [ start ] ; 

qindex = scratch [ start + 1 ] ; 

newhood [ start + indx ] = hood[ start + indx ]; 
make_remote ( & ( newhood [ start + d + indx ] ) ) ; 
syncthreadsO ; 

Let s be the 'shift', qindex-pindex-1 . 
Then hood [qindex . . . start + 2*d-l ] is copied, shifted left by s, to 

newhood [pindex+1 . . . ] . 

{mam 6: newhood[ start.. ]= hood[start..p] catenated hood[q..start+d-l ])+= 
shift = qindex - pindex - 1; 

if ( start + d + indx >= qindex ) 

newhood [ start + d + indx - shift ] = hood [ start + d + indx ] ; 

syncthreadsO ; 
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Figure 4: Sample cuda output, 1024 points 



Final closing brace in match and merge. 
{match and merge) +^ 
} 

(wagener.Makefile) = 

.SUFFIXES: .nw .tex .c 

wagener : wagener . nw 

/usr/bin/notangle -Rwagener -L wagener.nw > wagener. cu 

dvi : wagener.nw 

/usr/bin/noweave -delay wagener.nw > wagener.tex 

latex wagener 

latex wagener 

rm wagener . out wagener . aux 



clean: 



rm * . c * . dvi * . log 



3 Conclusions 

Wagener's PRAM algorithm, published only as a manuscript, is very clean and simple in comparison, 
for example, with another 0{\ogn) algorithm in [[I]. 

Our program illustrates how Wagener's PRAM algorithm might be realised on a CUDA chip: 
the organisation, at any rate, is faithful to the model. However, it is insensitive to the memory bank 
conflicts which make the chip, although robust enough to tolerate these conflicts, so slow that the 
parallel program is slow by comparison with another serial program (not described here). 
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On the other hand, we tried to avoid branching, another reason for serialisation, and the writing 
of branch-free code is an interesting challenge. 

Another possible innovation was our usage of padding, rather than compression, which we felt 
too cumbersome. That is, data would be in blocks, with 'live' data to the left of the block padded 
with 'remote' values on the right. This left some threads with nothing to do, but it avoided allocation 
tasks. 

A few last words about optimal speedup. Our algorithm gets the data points in sorted order, and in 
principle should use 0(ri) work (runtime x processor count): but it uses 0{n\ogn). We indicate how 
Wagener's algorithm can achieve optimal speedup: O(logn) time and 0{n) work. So we suppose we 
have n data points and ra/(log2 n) processors. 

• Separate the data into n/ log n strips, 1 per processor, and compute the convex hood in each 
strip, O(logra) time serially. 

• Store the hood comers in each strip (in left-to-right order) in balanced trees of size < log n. 

• Overmars and Van Leeuwen devised a logarithmic time procedure, a balanced search, for locat- 
ing common tangents: see [I1I2L Applying their procedure to convex hoods stored in balanced 
trees, convex hoods can be merged in logarithmic time. 

• This means that with log log n passes using < n/ log n processors per pass, convex hoods can 
be calculated for n/ log^ n strips each containing log^ n points, each in time O (log log n), hence 
0((loglogr2)^) overall, which is of course O(logn). 

• Under the PRAM model, these trees can be flattened into arrays using log n processors per tree. 
Now we have the same organisation as in our Cuda algorithm, with strips of log^ n points each 
stored in an array. 

• Our implementation involved finding the common tangent between adjacent hoods using k 
processors for hoods of size (at most) k, in 0(1) time. 

Given k > log^n, this can be done with k/\ogn processors. In this case there are at least 
y/k processors available. Let h = \fk, and let P be the points in the left-hand strip and Q the 
points on the right. Subdivide -ff (-P) into kjh intervals of length h. For each interval endpoint 
p, allocate h processors which first inspect intervals in H{ff) of length kjh, bracketing the 
tangent from -p to one of these intervals; next they bracket the tangent to an interval of length 
fc//i^, then kjYc' , and finally return the tangent from p to H{Q). This brackets the common 
tangent endpoint in H{P) to an interval of length k/h; repeat the process to bracket to intervals 
of length k/h'^ and k/h^, and finally compute the common tangent. 

When run on the dataset illustrated, our CUDA algorithm is perceptibly slower by comparison 
with a serial algorithm (which is not described here). This is not surprising considering the serialisa- 
tion of conflicting memory accesses. To attempt optimal speedup as described here would demand a 
great deal of effort. Our CUDA program is a specimen implementation of a PRAM algorithm which 
cannot claim much speed advantage. 



March 1, 2013 wagener.nw 16 

4 References 

1. 'Parallel Computational Geometry,' with Alok Aggarwal, Bernard Chazelle, Leo Guibas, and 
Chee-Keng Yap (1988). Algorithmica 3, 293-327 (special issue on Parallel Processing). 

2. 'Some parallel geometric algorithms' (1993), in Lectures on Parallel Computation, ed. Alan 
Gibbons and Paul Spirakis, Cambridge University Press, 77-108. 

3. H. Wagener (1985). Optimally parallel algorithms for convex hull determination. Manuscript, 
Technical University of Berlin. 



